[Day14] - Datatype：pl.Enum與pl.Categorical - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 14

Software Development

Polars熊霸天下系列第 14 篇

[Day14] - Datatype：pl.Enum與pl.Categorical

17th鐵人賽 python polars

Jerry Wu

2025-09-20 00:03:08

91 瀏覽

分享至

重要提醒：pl.Categorical在v.1.32.0進行了重大變更，本日內容將會以新版使用方式說明（v.1.33.1）。

今天我們來了解pl.Enum與pl.Categorical兩種型別的使用時機。

這兩種型別都是針對有限種類的pl.String型別而設計，因此舉凡四季或月份等可以列舉的事物，都很適合使用。依靠這兩個型別，Polars將不用真的儲存每一行，而是可以依靠編碼及索引關係來取值，這將大幅減少儲存空間及提升存取效率。

pl.Enum適用在可以事先確定所有列舉可能的時候，使用起來比較簡單；而pl.Categorical更適合用在事先無法確定所有列舉可能的時候。

本日內容將會以三種常用的作業系統作為例子，其順序為隨機定義，並無特別含義。

本日大綱如下：

本日引入模組及準備工作
pl.Categorical重要變更
pl.Enum
pl.Categorical
codepanda

0. 本日引入模組及準備工作

import polars as pl

os_data = ["macOS", "Linux", "Windows"]

1. `pl.Categorical`重要變更

pl.Categorical在v.1.32.0進行了重大變更，主要有兩點影響。

`ordering=`參數

ordering=將永遠為「"lexical"」，並廢除「"physical"」。

「"physical"」為原先預設的排序定義，以各元素第一次出現的行數來定義排序，在較前面行數第一次出現的元素排序較小，在較後面行數才第一次出現的元素排序較大。
「"lexical"」是以unicode code point為比較大小依據，也是就Python的ord()函數。一般常用的字串使用區段為「"數字0-9"」（48至57）、「"大寫英文字母A-Z"」（65至90）及「"小寫英文字母a-z"」（97至122）。

String cache

根據Ritchie Vink（Polars創始人）在LinkedIn的貼文，或許以後使用者將不需在意惱人的String cache，但教學文件尚未更新。

以下是目前教學中文件的範例：

from polars.exceptions import StringCacheMismatchError


bears_cat = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)
bears_cat2 = pl.Series(
    ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
)

try:
    print(bears_cat == bears_cat2)
except StringCacheMismatchError as exc:
    exc_str = str(exc).splitlines()[0]
    print("StringCacheMismatchError:", exc_str)

StringCacheMismatchError: cannot compare categoricals coming from different sources, consider setting a global StringCache.

這個錯誤訊息是指當pl.Categorical型別是分開被定義（即分別進行編碼）時，Polars將無法進行有效的運算。解決的方法有以下兩種：

使用pl.enable_string_cache()：只要在使用pl.Categorical前，加上pl.enable_string_cache()，就可以在全域範圍內使用string cache。官方文件特別提醒這是一個非常沒有效率的解決方法（雖然Rust很快...），只推薦在必要時刻使用。

使用pl.StringCache()：pl.StringCache()可以作為context manager或是decorator使用，這將使得Polars可以在局部範圍內，針對分開定義的pl.Categorical型別，進行有效率的編碼。

pl.StringCache()作為context manager使用：

with pl.StringCache():
    bears_cat = pl.Series(
        ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
    )
    bears_cat2 = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )

print(bears_cat == bears_cat2)

pl.StringCache()作為decorator使用：

@pl.StringCache()
def compare_bears() -> pl.Series:
    bears_cat = pl.Series(
        ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
    )
    bears_cat2 = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )

    print(bears_cat == bears_cat2)

compare_bears()

2. `pl.Enum`

建立pl.Enum最簡單的方式為傳入一個iterable，如一個列表：

enum_order = ["Linux", "macOS", "Windows"]
# "Linux" < "macOS" < "Windows"
common_os_enum = pl.Enum(enum_order)

此時common_os_enum即為pl.Enum型別，且具備可比較性（"Linux" < "macOS" < "Windows"）。

以下我們建立一個common_os_enum_df dataframe，內含「"os"」、「"os2"」及「"os3"」三列。其中「"os"」及「"os2"」列為pl.Enum型別，「"os3"」列為pl.String型別：

common_os_enum_df = (
    pl.DataFrame(
        {"os": os_data},
        schema={"os": common_os_enum},
    )
    .with_columns(pl.col("os").shuffle(seed=42).alias("os2"))
    .with_columns(pl.col("os2").cast(pl.String).alias("os3"))
)

shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ os      ┆ os2     ┆ os3     │
│ ---     ┆ ---     ┆ ---     │
│ enum    ┆ enum    ┆ str     │
╞═════════╪═════════╪═════════╡
│ macOS   ┆ Windows ┆ Windows │
│ Linux   ┆ Linux   ┆ Linux   │
│ Windows ┆ macOS   ┆ macOS   │
└─────────┴─────────┴─────────┘

這邊需留意，如果「"os"」及「"os2"」列中有不屬於enum_order元素的話，Polars將會回報InvalidOperationError。

與字串比較

pl.Enum可以與字串比較，但該字串必須是構成pl.Enum的元素之一，否則會回報InvalidOperationError。例如我們使用pl.DataFrame.filter()來篩選出「"os"」列中大於「"macOS"」字串的行數：

# "Linux" < "macOS" < "Windows"
common_os_enum_df.filter(pl.col("os").gt("macOS"))

shape: (1, 3)
┌─────────┬───────┬───────┐
│ os      ┆ os2   ┆ os3   │
│ ---     ┆ ---   ┆ ---   │
│ enum    ┆ enum  ┆ str   │
╞═════════╪═══════╪═══════╡
│ Windows ┆ macOS ┆ macOS │
└─────────┴───────┴───────┘

因為common_os_enum的排序大小為"Linux" < "macOS" < "Windows"，所以：

「"macOS"」 > 「"macOS"」? => False
「"Linux"」 > 「"macOS"」? => False
「"Windows"」 > 「"macOS"」? => True

只有最後一行符合篩選條件。

與`pl.String`型別比較

pl.Enum可以與pl.String型別比較，但pl.String之字串必須是構成pl.Enum的元素之一，否則會回報InvalidOperationError。例如我們可以計算「"os"」列（pl.Enum型別）是否大於「"os3"」（pl.String型別）列：

# "Linux" < "macOS" < "Windows"
(
    common_os_enum_df.with_columns(
        pl.col("os").gt(pl.col("os3")).alias("os > os3")
    )
)

shape: (3, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ os      ┆ os2     ┆ os3     ┆ os > os3 │
│ ---     ┆ ---     ┆ ---     ┆ ---      │
│ enum    ┆ enum    ┆ str     ┆ bool     │
╞═════════╪═════════╪═════════╪══════════╡
│ macOS   ┆ Windows ┆ Windows ┆ false    │
│ Linux   ┆ Linux   ┆ Linux   ┆ false    │
│ Windows ┆ macOS   ┆ macOS   ┆ true     │
└─────────┴─────────┴─────────┴──────────┘

因為common_os_enum的排序大小為"Linux" < "macOS" < "Windows"，所以：

「"macOS"」 > 「"Windows"」? => False
「"Linux"」 > 「"Linux"」? => False
「"Windows"」 > 「"macOS"」? => True

與`pl.Enum`型別比較

pl.Enum可以與pl.Enum型別比較，但必須是由相同的元素建構而成，否則會回報InvalidOperationError。例如我們可以計算「"os"」列（pl.Enum型別）是否大於「"os2"」（pl.Enum型別）列：

# "Linux" < "macOS" < "Windows"
(
    common_os_enum_df.with_columns(
        pl.col("os").gt(pl.col("os2")).alias("os > os2")
    )
)

shape: (3, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ os      ┆ os2     ┆ os3     ┆ os > os2 │
│ ---     ┆ ---     ┆ ---     ┆ ---      │
│ enum    ┆ enum    ┆ str     ┆ bool     │
╞═════════╪═════════╪═════════╪══════════╡
│ macOS   ┆ Windows ┆ Windows ┆ false    │
│ Linux   ┆ Linux   ┆ Linux   ┆ false    │
│ Windows ┆ macOS   ┆ macOS   ┆ true     │
└─────────┴─────────┴─────────┴──────────┘

因為common_os_enum的排序大小為"Linux" < "macOS" < "Windows"，所以：

「"macOS"」 > 「"Windows"」? => False
「"Linux"」 > 「"Linux"」? => False
「"Windows"」 > 「"macOS"」? => True

3. `pl.Categorical`

以下我們建立一個common_os_cat_df dataframe，內含「"os"」、「"os2"」及「"os3"」三列。其中「"os"」及「"os2"」列為pl.Categorical型別，「"os3"」列為pl.String型別：

common_os_cat_df = (
    pl.DataFrame({"os": os_data}, schema={"os": pl.Categorical()})
    .with_columns(pl.col("os").shuffle(seed=42).alias("os2"))
    .with_columns(pl.col("os2").cast(pl.String).alias("os3"))
)

shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ os      ┆ os2     ┆ os3     │
│ ---     ┆ ---     ┆ ---     │
│ cat     ┆ cat     ┆ str     │
╞═════════╪═════════╪═════════╡
│ macOS   ┆ Windows ┆ Windows │
│ Linux   ┆ Linux   ┆ Linux   │
│ Windows ┆ macOS   ┆ macOS   │
└─────────┴─────────┴─────────┘

與字串比較

pl.Categorical可以與字串進行比較，例如我們使用pl.DataFrame.filter()來篩選出「"os"」列中大於「"Windows"」字串的行數：

# ord("L")=76, ord("W")=87, ord("m")=109,
# "Linux" < "Windows" < "macOS"
common_os_cat_df.filter(pl.col("os").gt("Windows"))

shape: (1, 3)
┌───────┬─────────┬─────────┐
│ os    ┆ os2     ┆ os3     │
│ ---   ┆ ---     ┆ ---     │
│ cat   ┆ cat     ┆ str     │
╞═══════╪═════════╪═════════╡
│ macOS ┆ Windows ┆ Windows │
└───────┴─────────┴─────────┘

由於各行開頭字母皆不一樣，所以我們可以只計算開頭字母的ord()結果。因為排序為"Linux" < "Windows" < "macOS"，所以：

「"macOS"」 > 「"Windows"」? => True
「"Linux"」 > 「"Windows"」? => False
「"Windows"」 > 「"Windows"」? => False

只有第一行符合篩選條件。

與`pl.String`型別比較

pl.Categorical可以與pl.String型別比較。例如我們可以計算「"os"」列（pl.Categorical型別）是否大於「"os3"」（pl.String型別）列：

# ord("L")=76, ord("W")=87, ord("m")=109,
# "Linux" < "Windows" < "macOS"
(
    common_os_cat_df.with_columns(
        pl.col("os").gt(pl.col("os3")).alias("os > os3"),
    )
)

shape: (3, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ os      ┆ os2     ┆ os3     ┆ os > os3 │
│ ---     ┆ ---     ┆ ---     ┆ ---      │
│ cat     ┆ cat     ┆ str     ┆ bool     │
╞═════════╪═════════╪═════════╪══════════╡
│ macOS   ┆ Windows ┆ Windows ┆ true     │
│ Linux   ┆ Linux   ┆ Linux   ┆ false    │
│ Windows ┆ macOS   ┆ macOS   ┆ false    │
└─────────┴─────────┴─────────┴──────────┘

因為排序為"Linux" < "Windows" < "macOS"，所以：

「"macOS"」 > 「"Windows"」? => True
「"Linux"」 > 「"Linux"」? => False
「"Windows"」 > 「"macOS"」? => False

與`pl.Categorical`型別比較

pl.Categorical可以與pl.Categorical型別比較。例如我們可以計算「"os"」列（pl.Categorical型別）是否大於「"os2"」（pl.Categorical型別）列：

# ord("L")=76, ord("W")=87, ord("m")=109,
# "Linux" < "Windows" < "macOS"
(
    common_os_cat_df.with_columns(
        pl.col("os").gt(pl.col("os2")).alias("os > os2"),
    )
)

shape: (3, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ os      ┆ os2     ┆ os3     ┆ os > os2 │
│ ---     ┆ ---     ┆ ---     ┆ ---      │
│ cat     ┆ cat     ┆ str     ┆ bool     │
╞═════════╪═════════╪═════════╪══════════╡
│ macOS   ┆ Windows ┆ Windows ┆ true     │
│ Linux   ┆ Linux   ┆ Linux   ┆ false    │
│ Windows ┆ macOS   ┆ macOS   ┆ false    │
└─────────┴─────────┴─────────┴──────────┘

因為排序為"Linux" < "Windows" < "macOS"，所以：

「"macOS"」 > 「"Windows"」? => True
「"Linux"」 > 「"Linux"」? => False
「"Windows"」 > 「"macOS"」? => False

`pl.Expr.cat`命名空間

pl.Expr.cat命名空間有提供少數expr。

這裡我們展示如何利用pl.Expr.cat.get_categories()得到pl.Categorical內的元素：

common_os_cat_df.select(pl.col("os").cat.get_categories())

shape: (3, 1)
┌─────────┐
│ os      │
│ ---     │
│ str     │
╞═════════╡
│ macOS   │
│ Linux   │
│ Windows │
└─────────┘

其結果與選取「"os"」列一樣，但那是因為「"os"」列內剛好只有這三個元素。

此外，不知道眼尖的您有沒有發現，返回的的「"os"」列是pl.String型別。

4. `codepanda`

Pandas中相對應於polars的pl.Enum及pl.Categorical中的功能是pd.CategoricalDtype。

pd.CategoricalDtype可以分為無序及有序兩種，由其ordered=參數控制。

os_data_pd = ["Linux", "macOS", "Windows"]
os_cat_non_ordered = pd.CategoricalDtype(categories=os_data_pd)
os_cat_ordered = pd.CategoricalDtype(categories=os_data_pd, ordered=True)

df_pd = pd.DataFrame({"os": os_data_pd}).assign(
    os_cat_non_ordered=lambda df_: df_.os.astype(
        {"os": os_cat_non_ordered}
    ),
    os_cat_ordered=lambda df_: df_.os.astype({"os": os_cat_ordered}),
)

        os os_cat_non_ordered os_cat_ordered
0    Linux              Linux          Linux
1    macOS              macOS          macOS
2  Windows            Windows        Windows

如果是將無序的pd.CategoricalDtype與其內含的種類進行比較時，會報錯如下：

❌
# TypeError: Unordered Categoricals can only compare equality or not
df_pd.query("os_cat_non_ordered > 'macOS'")

而如果是有序的pd.CategoricalDtype，則可以順利與其內含的種類進行比較，例如：

df_pd.query("os_cat_ordered > 'macOS'")

        os os_cat_non_ordered os_cat_ordered
2  Windows            Windows        Windows

備註

註1：當pl.Categorical與字串或pl.String進行比較時，若元素不在預先定義的pl.Categorical內時，並不會報錯，仍然可以進行比較。舉例來說，下面這個例子，我們比較「"os"」列是否大於「"A"」字串：

# ord("A)=65, ord("L")=76, ord("W")=87, ord("m")=109,
# "A" < "Linux" < "Windows" < "macOS"
df0 = pl.DataFrame({"os": os_data}, schema={"os": pl.Categorical()})
print(df0.with_columns(pl.col("os").gt("A").alias("> A")))

shape: (3, 2)
┌─────────┬──────┐
│ os      ┆ > A  │
│ ---     ┆ ---  │
│ cat     ┆ bool │
╞═════════╪══════╡
│ macOS   ┆ true │
│ Linux   ┆ true │
│ Windows ┆ true │
└─────────┴──────┘

「"A"」字串雖然不在預先給定的pl.Categorical中，但仍可與「"os"」列進行比較。由於所有元素的ord值皆大於「"A"」字串的ord值（65），所以比較結果皆為True。

Code

本日程式碼傳送門。

[Day13] - Datatype：Temporal

[Day15] - 排序

系列文

Polars熊霸天下共 30 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19847 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Polars熊霸天下系列 第 14 篇

[Day14] - Datatype：pl.Enum與pl.Categorical

0. 本日引入模組及準備工作

1. pl.Categorical重要變更

ordering=參數

String cache

2. pl.Enum

與字串比較

與pl.String型別比較

與pl.Enum型別比較

3. pl.Categorical

與字串比較

與pl.String型別比較

與pl.Categorical型別比較

pl.Expr.cat命名空間

4. codepanda

備註

Code

尚未有邦友留言

標記使用者

Polars熊霸天下系列第 14 篇

1. `pl.Categorical`重要變更

`ordering=`參數

2. `pl.Enum`

與`pl.String`型別比較

與`pl.Enum`型別比較

3. `pl.Categorical`

與`pl.String`型別比較

與`pl.Categorical`型別比較

`pl.Expr.cat`命名空間

4. `codepanda`